Pesquisa | Portal Regional da BVS

1.

Deciphering cell types by integrating scATAC-seq data with genome sequences.

Zeng, Yuansong; Luo, Mai; Shangguan, Ningyuan; Shi, Peiyu; Feng, Junxi; Xu, Jin; Chen, Ken; Lu, Yutong; Yu, Weijiang; Yang, Yuedong.

Nat Comput Sci ; 2024 Apr 10.

Artigo em Inglês | MEDLINE | ID: mdl-38600256

RESUMO

The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.

2.

Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting.

Ding, Maolin; Chen, Ken; Yang, Yuedong; Zhao, Huiying.

Hum Genet ; 2024 Apr 04.

Artigo em Inglês | MEDLINE | ID: mdl-38575818

RESUMO

Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.

3.

Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction.

Chen, Ken; Zhou, Yue; Ding, Maolin; Wang, Yu; Ren, Zhixiang; Yang, Yuedong.

Brief Bioinform ; 25(3)2024 Mar 27.

Artigo em Inglês | MEDLINE | ID: mdl-38605640

RESUMO

Language models pretrained by self-supervised learning (SSL) have been widely utilized to study protein sequences, while few models were developed for genomic sequences and were limited to single species. Due to the lack of genomes from different species, these models cannot effectively leverage evolutionary information. In this study, we have developed SpliceBERT, a language model pretrained on primary ribonucleic acids (RNA) sequences from 72 vertebrates by masked language modeling, and applied it to sequence-based modeling of RNA splicing. Pretraining SpliceBERT on diverse species enables effective identification of evolutionarily conserved elements. Meanwhile, the learned hidden states and attention weights can characterize the biological properties of splice sites. As a result, SpliceBERT was shown effective on several downstream tasks: zero-shot prediction of variant effects on splicing, prediction of branchpoints in humans, and cross-species prediction of splice sites. Our study highlighted the importance of pretraining genomic language models on a diverse range of species and suggested that SSL is a promising approach to enhance our understanding of the regulatory logic underlying genomic sequences.

Assuntos

Splicing de RNA , Vertebrados , Animais , Humanos , Sequência de Bases , Vertebrados/genética , RNA , Aprendizado de Máquina Supervisionado

4.

Genome-scale annotation of protein binding sites via language model and geometric deep learning.

Yuan, Qianmu; Tian, Chong; Yang, Yuedong.

Elife ; 132024 Apr 17.

Artigo em Inglês | MEDLINE | ID: mdl-38630609

RESUMO

Revealing protein binding sites with other molecules, such as nucleic acids, peptides, or small ligands, sheds light on disease mechanism elucidation and novel drug design. With the explosive growth of proteins in sequence databases, how to accurately and efficiently identify these binding sites from sequences becomes essential. However, current methods mostly rely on expensive multiple sequence alignments or experimental protein structures, limiting their genome-scale applications. Besides, these methods haven't fully explored the geometry of the protein structures. Here, we propose GPSite, a multi-task network for simultaneously predicting binding residues of DNA, RNA, peptide, protein, ATP, HEM, and metal ions on proteins. GPSite was trained on informative sequence embeddings and predicted structures from protein language models, while comprehensively extracting residual and relational geometric contexts in an end-to-end manner. Experiments demonstrate that GPSite substantially surpasses state-of-the-art sequence-based and structure-based approaches on various benchmark datasets, even when the structures are not well-predicted. The low computational cost of GPSite enables rapid genome-scale binding residue annotations for over 568,000 sequences, providing opportunities to unveil unexplored associations of binding sites with molecular functions, biological processes, and genetic variants. The GPSite webserver and annotation database can be freely accessed at https://bio-web1.nscc-gz.cn/app/GPSite.

Assuntos

Aprendizado Profundo , Ligação Proteica , Proteínas/metabolismo , Sítios de Ligação , Peptídeos/metabolismo

5.

Applying image features of proximal paracancerous tissues in predicting prognosis of patients with hepatocellular carcinoma.

Lin, Siying; Yong, Juanjuan; Zhang, Lei; Chen, Xiaolong; Qiao, Liang; Pan, Weidong; Yang, Yuedong; Zhao, Huiying.

Comput Biol Med ; 173: 108365, 2024 May.

Artigo em Inglês | MEDLINE | ID: mdl-38537563

RESUMO

BACKGROUND: Most of the methods using digital pathological image for predicting Hepatocellular carcinoma (HCC) prognosis have not considered paracancerous tissue microenvironment (PTME), which are potentially important for tumour initiation and metastasis. This study aimed to identify roles of image features of PTME in predicting prognosis and tumour recurrence of HCC patients. METHODS: We collected whole slide images (WSIs) of 146 HCC patients from Sun Yat-sen Memorial Hospital (SYSM dataset). For each WSI, five types of regions of interests (ROIs) in PTME and tumours were manually annotated. These ROIs were used to construct a Lasso Cox survival model for predicting the prognosis of HCC patients. To make the model broadly useful, we established a deep learning method to automatically segment WSIs, and further used it to construct a prognosis prediction model. This model was tested by the samples of 225 HCC patients from the Cancer Genome Atlas Liver Hepatocellular Carcinoma (TCGA-LIHC). RESULTS: In predicting prognosis of the HCC patients, using the image features of manually annotated ROIs in PTME achieved C-index 0.668 in the SYSM testing dataset, which is higher than the C-index 0.648 reached by the model only using image features of tumours. Integrating ROIs of PTME and tumours achieved C-index 0.693 in the SYSM testing dataset. The model using automatically segmented ROIs of PTME and tumours achieved C-index of 0.665 (95% CI: 0.556-0.774) in the TCGA-LIHC samples, which is better than the widely used methods, WSISA (0.567), DeepGraphSurv (0.593), and SeTranSurv (0.642). Finally, we found the Texture SumAverage Skew HV on immune cell infiltration and Texture related features on desmoplastic reaction are the most important features of PTME in predicting HCC prognosis. We additionally used the model in prediction HCC recurrence for patients from SYSM-training, SYSM-testing, and TCGA-LIHC datasets, indicating the important roles of PTME in the prediction. CONCLUSIONS: Our results indicate image features of PTME is critical for improving the prognosis prediction of HCC. Moreover, the image features related with immune cell infiltration and desmoplastic reaction of PTME are the most important factors associated with prognosis of HCC.

Assuntos

Carcinoma Hepatocelular , Neoplasias Hepáticas , Humanos , Carcinoma Hepatocelular/diagnóstico por imagem , Neoplasias Hepáticas/diagnóstico por imagem , Hospitais , Microambiente Tumoral

6.

Self-Supervised Contrastive Molecular Representation Learning with a Chemical Synthesis Knowledge Graph.

Xie, Jiancong; Wang, Yi; Rao, Jiahua; Zheng, Shuangjia; Yang, Yuedong.

J Chem Inf Model ; 64(6): 1945-1954, 2024 Mar 25.

Artigo em Inglês | MEDLINE | ID: mdl-38484468

RESUMO

Self-supervised molecular representation learning has demonstrated great promise in bridging machine learning and chemical science to accelerate the development of new drugs. Due to the limited reaction data, existing methods are mostly pretrained by augmenting the intrinsic topology of molecules without effectively incorporating chemical reaction prior information, which makes them difficult to generalize to chemical reaction-related tasks. To address this issue, we propose ReaKE, a reaction knowledge embedding framework, which formulates chemical reactions as a knowledge graph. Specifically, we constructed a chemical synthesis knowledge graph with reactants and products as nodes and reaction rules as the edges. Based on the knowledge graph, we further proposed novel contrastive learning at both molecule and reaction levels to capture the reaction-related functional group information within and between molecules. Extensive experiments demonstrate the effectiveness of ReaKE compared with state-of-the-art methods on several downstream tasks, including reaction classification, product prediction, and yield prediction.

Assuntos

Aprendizado de Máquina , Reconhecimento Automatizado de Padrão

7.

TCR signaling induces STAT3 phosphorylation to promote TH17 cell differentiation.

Qin, Zhen; Wang, Ruining; Hou, Ping; Zhang, Yuanyuan; Yuan, Qianmu; Wang, Ying; Yang, Yuedong; Xu, Tao.

J Exp Med ; 221(3)2024 Mar 04.

Artigo em Inglês | MEDLINE | ID: mdl-38324068

RESUMO

TH17 differentiation is critically controlled by "signal 3" of cytokines (IL-6/IL-23) through STAT3. However, cytokines alone induced only a moderate level of STAT3 phosphorylation. Surprisingly, TCR stimulation alone induced STAT3 phosphorylation through Lck/Fyn, and synergistically with IL-6/IL-23 induced robust and optimal STAT3 phosphorylation at Y705. Inhibition of Lck/Fyn kinase activity by Srci1 or disrupting the interaction between Lck/Fyn and STAT3 by disease-causing STAT3 mutations selectively impaired TCR stimulation, but not cytokine-induced STAT3 phosphorylation, which consequently abolished TH17 differentiation and converted them to FOXP3+ Treg cells. Srci1 administration or disrupting the interaction between Lck/Fyn and STAT3 significantly ameliorated TH17 cell-mediated EAE disease. These findings uncover an unexpected deterministic role of TCR signaling in fate determination between TH17 and Treg cells through Lck/Fyn-dependent phosphorylation of STAT3, which can be exploited to develop therapeutics selectively against TH17-related autoimmune diseases. Our study thus provides insight into how TCR signaling could integrate with cytokine signal to direct T cell differentiation.

Assuntos

Encefalomielite Autoimune Experimental , Receptores de Antígenos de Linfócitos T , Células Th17 , Diferenciação Celular , Citocinas , Interleucina-23 , Interleucina-6 , Proteína Tirosina Quinase p56(lck) Linfócito-Específica , Fosforilação , Encefalomielite Autoimune Experimental/imunologia , Animais

8.

An uncertainty-based interpretable deep learning framework for predicting breast cancer outcome.

Chai, Hua; Lin, Siyin; Lin, Junqi; He, Minfan; Yang, Yuedong; OuYang, Yongzhong; Zhao, Huiying.

BMC Bioinformatics ; 25(1): 88, 2024 Feb 29.

Artigo em Inglês | MEDLINE | ID: mdl-38418940

RESUMO

BACKGROUND: Predicting outcome of breast cancer is important for selecting appropriate treatments and prolonging the survival periods of patients. Recently, different deep learning-based methods have been carefully designed for cancer outcome prediction. However, the application of these methods is still challenged by interpretability. In this study, we proposed a novel multitask deep neural network called UISNet to predict the outcome of breast cancer. The UISNet is able to interpret the importance of features for the prediction model via an uncertainty-based integrated gradients algorithm. UISNet improved the prediction by introducing prior biological pathway knowledge and utilizing patient heterogeneity information. RESULTS: The model was tested in seven public datasets of breast cancer, and showed better performance (average C-index = 0.691) than the state-of-the-art methods (average C-index = 0.650, ranged from 0.619 to 0.677). Importantly, the UISNet identified 20 genes as associated with breast cancer, among which 11 have been proven to be associated with breast cancer by previous studies, and others are novel findings of this study. CONCLUSIONS: Our proposed method is accurate and robust in predicting breast cancer outcomes, and it is an effective way to identify breast cancer-associated genes. The method codes are available at: https://github.com/chh171/UISNet .

Assuntos

Neoplasias da Mama , Aprendizado Profundo , Humanos , Feminino , Neoplasias da Mama/genética , Incerteza , Redes Neurais de Computação , Algoritmos

9.

Predicting disease-gene associations through self-supervised mutual infomax graph convolution network.

Xie, Jiancong; Rao, Jiahua; Xie, Junjie; Zhao, Huiying; Yang, Yuedong.

Comput Biol Med ; 170: 108048, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38310804

RESUMO

Illuminating associations between diseases and genes can help reveal the pathogenesis of syndromes and contribute to treatments, but a large number of associations remained unexplored. To identify novel disease-gene associations, many computational methods have been developed using disease and gene-related prior knowledge. However, these methods remain of relatively inferior performance due to the limited external data sources and the inevitable noise among the prior knowledge. In this study, we have developed a new method, Self-Supervised Mutual Infomax Graph Convolution Network (MiGCN), to predict disease-gene associations under the guidance of external disease-disease and gene-gene collaborative graphs. The noises within the collaborative graphs were eliminated by maximizing the mutual information between nodes and neighbors through a graphical mutual infomax layer. In parallel, the node interactions were strengthened by a novel informative message passing layer to improve the learning ability of graph neural network. The extensive experiments showed that our model achieved performance improvement over the state-of-art method by more than 8 % on AUC. The datasets, source codes and trained models of MiGCN are available at https://github.com/biomed-AI/MiGCN.

Assuntos

Aprendizagem , Redes Neurais de Computação , Humanos , Software , Síndrome

10.

GRELinker: A Graph-Based Generative Model for Molecular Linker Design with Reinforcement and Curriculum Learning.

Zhang, Hao; Huang, Jinchao; Xie, Junjie; Huang, Weifeng; Yang, Yuedong; Xu, Mingyuan; Lei, Jinping; Chen, Hongming.

J Chem Inf Model ; 64(3): 666-676, 2024 Feb 12.

Artigo em Inglês | MEDLINE | ID: mdl-38241022

RESUMO

Fragment-based drug discovery (FBDD) is widely used in drug design. One useful strategy in FBDD is designing linkers for linking fragments to optimize their molecular properties. In the current study, we present a novel generative fragment linking model, GRELinker, which utilizes a gated-graph neural network combined with reinforcement and curriculum learning to generate molecules with desirable attributes. The model has been shown to be efficient in multiple tasks, including controlling logâ¯P, optimizing synthesizability or predicted bioactivity of compounds, and generating molecules with high 3D similarity but low 2D similarity to the lead compound. Specifically, our model outperforms the previously reported reinforcement learning (RL) built-in method DRlinker on these benchmark tasks. Moreover, GRELinker has been successfully used in an actual FBDD case to generate optimized molecules with enhanced affinities by employing the docking score as the scoring function in RL. Besides, the implementation of curriculum learning in our framework enables the generation of structurally complex linkers more efficiently. These results demonstrate the benefits and feasibility of GRELinker in linker design for molecular optimization and drug discovery.

Assuntos

Desenho de Fármacos , Descoberta de Drogas , Redes Neurais de Computação , Aprendizagem , Currículo

11.

The influence of short-range molecular order in gelatinized starch on the formation of starch-lauric acid complexes.

Chao, Chen; Huang, Shiqing; Yu, Jinglin; Copeland, Les; Yang, Yuedong; Wang, Shujun.

Int J Biol Macromol ; 260(Pt 2): 129526, 2024 Mar.

Artigo em Inglês | MEDLINE | ID: mdl-38242387

RESUMO

A model system of gelatinized wheat starch (GWS) and lauric acid (LA) was used to examine the effect of residual short-range molecular order in GWS on the formation of starch-lipid complexes. The extent of residual short-range molecular order, as determined by Raman spectroscopy, decreased with increasing water content or heating duration of gelatinization. The enthalpy changes, crystallinity, short-range molecular order and the in vitro enzymic digestion of GWS-LA complexes increased initially to a maximum and then declined as the short-range molecular order in GWS decreased, showing that there was an optimal amount of residual short-range molecular order in GWS for maximizing GWS-LA complexes formation. Below this optimum amount, the limited disruption of short-range molecular order may constrain the mobility of amylose chains for complexation with LA, whereas with excessive disruption above this amount the amylose chains may be too disorganized or entangled to form complexes with LA. The susceptibility of GWS-LA complexes to enzymatic hydrolysis was influenced by both long- and short-range structural order, and to a lesser extent the amounts of complexes. This study showed clearly the role of short-range molecular order in gelatinized starch in influencing the formation of GWS-LA complexes.

Assuntos

Amilose , Amido , Amido/química , Amilose/química , Ácidos Láuricos/química , Hidrólise

12.

DiffDec: Structure-Aware Scaffold Decoration with an End-to-End Diffusion Model.

Xie, Junjie; Chen, Sheng; Lei, Jinping; Yang, Yuedong.

J Chem Inf Model ; 64(7): 2554-2564, 2024 Apr 08.

Artigo em Inglês | MEDLINE | ID: mdl-38267393

RESUMO

In molecular optimization, one popular way is R-group decoration on molecular scaffolds, and many efforts have been made to generate R-groups based on deep generative models. However, these methods mostly use information on known binding ligands, without fully utilizing target structure information. In this study, we proposed a new method, DiffDec, to involve 3D pocket constraints by a modified diffusion technique for optimizing molecules through molecular scaffold decoration. For end-to-end generation of R-groups with different sizes, we designed a novel fake atom mechanism. DiffDec was shown to be able to generate structure-aware R-groups with realistic geometric substructures by the analysis of bond angles and dihedral angles and simultaneously generate multiple R-groups for one scaffold on different growth anchors. The growth anchors could be provided by users or automatically determined by our model. DiffDec achieved R-group recovery rates of 69.67% and 45.34% in the single and multiple R-group decoration tasks, respectively, and these values were significantly higher than competing methods (37.33% and 26.85%). According to the molecular docking study, our decorated molecules obtained a better average binding affinity than baseline methods. The docking pose analysis revealed that DiffDec could decorate scaffolds with R-groups that exhibited improved binding affinities and more favorable interactions with the pocket. These results demonstrated the potential and applicability of DiffDec in real-world scaffold decoration for molecular optimization.

Assuntos

Relação Quantitativa Estrutura-Atividade , Simulação de Acoplamento Molecular

13.

Genome-wide association and Mendelian randomization analysis provide insights into the shared genetic architecture between high-dimensional electrocardiographic features and ischemic heart disease.

Wang, Xinfeng; Qi, Mengling; Zhang, Haoyang; Yang, Yuedong; Zhao, Huiying.

Hum Genet ; 143(1): 49-58, 2024 Jan.

Artigo em Inglês | MEDLINE | ID: mdl-38180560

RESUMO

Observational studies have revealed that ischemic heart disease (IHD) has a unique manifestation on electrocardiographic (ECG). However, the genetic relationships between IHD and ECG remain unclear. We took 12-lead ECG as phenotypes to conduct genome-wide association studies (GWAS) for 41,960 samples from UK-Biobank (UKB). By leveraging large-scale GWAS summary of ECG and IHD (downloaded from FinnGen database), we performed LD score regression (LDSC), Mendelian randomization (MR), and polygenic risk score (PRS) regression to explore genetic relationships between IHD and ECG. Finally, we constructed an XGBoost model to predict IHD by integrating PRS and ECG. The GWAS identified 114 independent SNPs significantly (P value < 5 × 10-8/800, where 800 denotes the number of ECG features) associated with ECG. LDSC analysis indicated significant (P value < 0.05) genetic correlations between 39 ECG features and IHD. MR analysis performed by five approaches showed a putative causal effect of IHD on four S wave related ECG features at lead III. Integrating PRS for these ECG features with age and gender, the XGBoost model achieved Area Under Curve (AUC) 0.72 in predicting IHD. Here, we provide genetic evidence supporting S wave related ECG features at lead III to monitor the IHD risk, and open up a unique approach to integrate ECG with genetic factors for pre-warning IHD.

Assuntos

Estudo de Associação Genômica Ampla , Isquemia Miocárdica , Humanos , Análise da Randomização Mendeliana/métodos , Isquemia Miocárdica/genética , Polimorfismo de Nucleotídeo Único , Fenótipo , 60488

14.

EVLncRNAs 3.0: an updated comprehensive database for manually curated functional long non-coding RNAs validated by low-throughput experiments.

Zhou, Bailing; Ji, Baohua; Shen, Congcong; Zhang, Xia; Yu, Xue; Huang, Pingping; Yu, Ru; Zhang, Hongmei; Dou, Xianghua; Chen, Qingshuai; Zeng, Qiangcheng; Wang, Xiaoxin; Cao, Zanxia; Hu, Guodong; Xu, Shicai; Zhao, Huiying; Yang, Yuedong; Zhou, Yaoqi; Wang, Jihua.

Nucleic Acids Res ; 52(D1): D98-D106, 2024 Jan 05.

Artigo em Inglês | MEDLINE | ID: mdl-37953349

RESUMO

Long noncoding RNAs (lncRNAs) have emerged as crucial regulators across diverse biological processes and diseases. While high-throughput sequencing has enabled lncRNA discovery, functional characterization remains limited. The EVLncRNAs database is the first and exclusive repository for all experimentally validated functional lncRNAs from various species. After previous releases in 2018 and 2021, this update marks a major expansion through exhaustive manual curation of nearly 25 000 publications from 15 May 2020, to 15 May 2023. It incorporates substantial growth across all categories: a 154% increase in functional lncRNAs, 160% in associated diseases, 186% in lncRNA-disease associations, 235% in interactions, 138% in structures, 234% in circular RNAs, 235% in resistant lncRNAs and 4724% in exosomal lncRNAs. More importantly, it incorporated additional information include functional classifications, detailed interaction pathways, homologous lncRNAs, lncRNA locations, COVID-19, phase-separation and organoid-related lncRNAs. The web interface was substantially improved for browsing, visualization, and searching. ChatGPT was tested for information extraction and functional overview with its limitation noted. EVLncRNAs 3.0 represents the most extensive curated resource of experimentally validated functional lncRNAs and will serve as an indispensable platform for unravelling emerging lncRNA functions. The updated database is freely available at https://www.sdklab-biophysics-dzu.net/EVLncRNAs3/.

Assuntos

Bases de Dados de Ácidos Nucleicos , RNA Longo não Codificante , Gerenciamento de Dados , Armazenamento e Recuperação da Informação , RNA Longo não Codificante/genética

15.

Predicting the effects of mutations on protein solubility using graph convolution network and protein language model representation.

Wang, Jing; Chen, Sheng; Yuan, Qianmu; Chen, Jianwen; Li, Danping; Wang, Lei; Yang, Yuedong.

J Comput Chem ; 45(8): 436-445, 2024 Mar 30.

Artigo em Inglês | MEDLINE | ID: mdl-37933773

RESUMO

Solubility is one of the most important properties of protein. Protein solubility can be greatly changed by single amino acid mutations and the reduced protein solubility could lead to diseases. Since experimental methods to determine solubility are time-consuming and expensive, in-silico methods have been developed to predict the protein solubility changes caused by mutations mostly through protein evolution information. However, these methods are slow since it takes long time to obtain evolution information through multiple sequence alignment. In addition, these methods are of low performance because they do not fully utilize protein 3D structures due to a lack of experimental structures for most proteins. Here, we proposed a sequence-based method DeepMutSol to predict solubility change from residual mutations based on the Graph Convolutional Neural Network (GCN), where the protein graph was initiated according to predicted protein structure from Alphafold2, and the nodes (residues) were represented by protein language embeddings. To circumvent the small data of solubility changes, we further pretrained the model over absolute protein solubility. DeepMutSol was shown to outperform state-of-the-art methods in benchmark tests. In addition, we applied the method to clinically relevant genes from the ClinVar database and the predicted solubility changes were shown able to separate pathogenic mutations. All of the data sets and the source code are available at https://github.com/biomed-AI/DeepMutSol.

Assuntos

Aminoácidos , Benchmarking , Solubilidade , Mutação , Idioma

16.

Subgraph extraction and graph representation learning for single cell Hi-C imputation and clustering.

Zheng, Jiahao; Yang, Yuedong; Dai, Zhiming.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38040494

RESUMO

Single-cell Hi-C (scHi-C) technology enables the investigation of 3D chromatin structure variability across individual cells. However, the analysis of scHi-C data is challenged by a large number of missing values. Here, we present a scHi-C data imputation model HiC-SGL, based on Subgraph extraction and graph representation learning. HiC-SGL can also learn informative low-dimensional embeddings of cells. We demonstrate that our method surpasses existing methods in terms of imputation accuracy and clustering performance by various metrics.

Assuntos

Cromatina , Cromatina/genética , Análise por Conglomerados

17.

Jiang, Wei; Wang, Peng-Ying; Zhou, Qi; Lin, Qiu-Tong; Yao, Yao; Huang, Xun; Tan, Xiaoming; Yang, Shihui; Ye, Weicai; Yang, Yuedong; Bao, Yun-Juan.

J Transl Med ; 21(1): 885, 2023 Dec 06.

Artigo em Inglês | MEDLINE | ID: mdl-38057859

RESUMO

BACKGROUND: With the development of cancer precision medicine, a huge amount of high-dimensional cancer information has rapidly accumulated regarding gene alterations, diseases, therapeutic interventions and various annotations. The information is highly fragmented across multiple different sources, making it highly challenging to effectively utilize and exchange the information. Therefore, it is essential to create a resource platform containing well-aggregated, carefully mined, and easily accessible data for effective knowledge sharing. METHODS: In this study, we have developed "Consensus Cancer Core" (Tri©DB), a new integrative cancer precision medicine knowledgebase and reporting system by mining and harmonizing multifaceted cancer data sources, and presenting them in a centralized platform with enhanced functionalities for accessibility, annotation and analysis. RESULTS: The knowledgebase provides the currently most comprehensive information on cancer precision medicine covering more than 40 annotation entities, many of which are novel and have never been explored previously. Tri©DB offers several unique features: (i) harmonizing the cancer-related information from more than 30 data sources into one integrative platform for easy access; (ii) utilizing a variety of data analysis and graphical tools for enhanced user interaction with the high-dimensional data; (iii) containing a newly developed reporting system for automated annotation and therapy matching for external patient genomic data. Benchmark test indicated that Tri©DB is able to annotate 46% more treatments than two officially recognized resources, oncoKB and MCG. Tri©DB was further shown to have achieved 94.9% concordance with administered treatments in a real clinical trial. CONCLUSIONS: The novel features and rich functionalities of the new platform will facilitate full access to cancer precision medicine data in one single platform and accommodate the needs of a broad range of researchers not only in translational medicine, but also in basic biomedical research. We believe that it will help to promote knowledge sharing in cancer precision medicine. Tri©DB is freely available at www.biomeddb.org , and is hosted on a cutting-edge technology architecture supporting all major browsers and mobile handsets.

Assuntos

Neoplasias , Medicina de Precisão , Humanos , Medicina de Precisão/métodos , Genômica/métodos , Neoplasias/genética , Neoplasias/terapia , Bases de Conhecimento

18.

From intuition to AI: evolution of small molecule representations in drug discovery.

McGibbon, Miles; Shave, Steven; Dong, Jie; Gao, Yumiao; Houston, Douglas R; Xie, Jiancong; Yang, Yuedong; Schwaller, Philippe; Blay, Vincent.

Brief Bioinform ; 25(1)2023 11 22.

Artigo em Inglês | MEDLINE | ID: mdl-38033290

RESUMO

Within drug discovery, the goal of AI scientists and cheminformaticians is to help identify molecular starting points that will develop into safe and efficacious drugs while reducing costs, time and failure rates. To achieve this goal, it is crucial to represent molecules in a digital format that makes them machine-readable and facilitates the accurate prediction of properties that drive decision-making. Over the years, molecular representations have evolved from intuitive and human-readable formats to bespoke numerical descriptors and fingerprints, and now to learned representations that capture patterns and salient features across vast chemical spaces. Among these, sequence-based and graph-based representations of small molecules have become highly popular. However, each approach has strengths and weaknesses across dimensions such as generality, computational cost, inversibility for generative applications and interpretability, which can be critical in informing practitioners' decisions. As the drug discovery landscape evolves, opportunities for innovation continue to emerge. These include the creation of molecular representations for high-value, low-data regimes, the distillation of broader biological and chemical knowledge into novel learned representations and the modeling of up-and-coming therapeutic modalities.

Assuntos

Descoberta de Drogas , Intuição , Humanos , Aprendizagem

19.

Effect of Chestnut (Castanea Mollissima Blume) Bur Polyphenol Extract on Shigella dysenteriae: Antibacterial Activity and the Mechanism.

Peng, Fei; Chen, Linan; Wang, Xiuping; Yu, Zuoqing; Cheng, Caihong; Yang, Yuedong.

Molecules ; 28(19)2023 Oct 09.

Artigo em Inglês | MEDLINE | ID: mdl-37836834

RESUMO

Shigella dysenteriae is a highly pathogenic microorganism that can cause human bacillary dysentery by contaminating food and drinking water. This study investigated the antibacterial activity of chestnut bur polyphenol extract (CBPE) on S. dysenteriae and the underlying mechanism. The results showed that the minimum inhibitory concentration (MIC) of CBPE for S. dysenteriae was 0.4 mg/mL, and the minimum bactericidal concentration (MBC) was 1.6 mg/mL. CBPE treatment irreversibly disrupted cell morphology, decreased cell activity, and increased cell membrane permeability, cell membrane depolarization, and cell content leakage of S. dysenteriae, indicating that CBPE has obvious destructive effects on the cell membrane and cell wall of S. dysenteriae. Combined transcriptomic and metabolomics analysis revealed that CBPE inhibits S. dysenteriae by interfering with ABC protein transport, sulfur metabolism, purine metabolism, amino acid metabolism, glycerophospholipid metabolism, and some other pathways. These findings provide a theoretical basis for the prevention and treatment of S. dysenteriae infection with extract from chestnut burs.

Assuntos

Disenteria Bacilar , Shigella dysenteriae , Humanos , Polifenóis/farmacologia , Antibacterianos/farmacologia , Disenteria Bacilar/microbiologia , Extratos Vegetais/farmacologia

20.

Accurately identifying nucleic-acid-binding sites through geometric graph learning on language model predicted structures.

Song, Yidong; Yuan, Qianmu; Zhao, Huiying; Yang, Yuedong.

Brief Bioinform ; 24(6)2023 09 22.

Artigo em Inglês | MEDLINE | ID: mdl-37824738

RESUMO

The interactions between nucleic acids and proteins are important in diverse biological processes. The high-quality prediction of nucleic-acid-binding sites continues to pose a significant challenge. Presently, the predictive efficacy of sequence-based methods is constrained by their exclusive consideration of sequence context information, whereas structure-based methods are unsuitable for proteins lacking known tertiary structures. Though protein structures predicted by AlphaFold2 could be used, the extensive computing requirement of AlphaFold2 hinders its use for genome-wide applications. Based on the recent breakthrough of ESMFold for fast prediction of protein structures, we have developed GLMSite, which accurately identifies DNA- and RNA-binding sites using geometric graph learning on ESMFold predicted structures. Here, the predicted protein structures are employed to construct protein structural graph with residues as nodes and spatially neighboring residue pairs for edges. The node representations are further enhanced through the pre-trained language model ProtTrans. The network was trained using a geometric vector perceptron, and the geometric embeddings were subsequently fed into a common network to acquire common binding characteristics. Finally, these characteristics were input into two fully connected layers to predict binding sites with DNA and RNA, respectively. Through comprehensive tests on DNA/RNA benchmark datasets, GLMSite was shown to surpass the latest sequence-based methods and be comparable with structure-based methods. Moreover, the prediction was shown useful for inferring nucleic-acid-binding proteins, demonstrating its potential for protein function discovery. The datasets, codes, and trained models are available at https://github.com/biomed-AI/nucleic-acid-binding.

Assuntos

Redes Neurais de Computação , Proteínas , Sítios de Ligação , Proteínas/química , RNA/metabolismo , DNA , Idioma

RESUMO

RESUMO

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

RESUMO

Assuntos

ENVIAR RESULTADO:

SELEÇÃO DE REFERÊNCIAS

DETALHE DA PESQUISA